Binary search trees

Bonz

This article explains the algorithms behind balanced binary search trees like those implemented by the STL's std::set class. Balanced binary trees are very interesting because they guarantee logarithmic performance: searching in a tree with ten thousand nodes will take only 20 or so steps; in other words, such trees scale very well and you can use them whenever you need to do many different lookups on a dynamic data structure (if it were static, you could as well use qsort on it and do binary searches).

I will start with an overview on unbalanced binary search trees, which however should be widely known; then I'll present two kinds of balanced binary trees, the AVL tree and the red-black tree.

Binary trees

Binary trees are a kind of precalculated binary searches. Just like a binary search works by discarding half an array at a time (either the half that is all smaller than the element being sought, or the half that is all larger), in a binary search trees each element does not only carry its own value, it also partitions the remaining elements into smaller and larger values. Here is an example of a binary tree:

                                   114
                       ___,....---'   `---...___
                      106                       122
                _,..-'   `._              _,..-'   `._
               102          110          118          126
             ,'   \       ,'   \       ,'   \       ,'   \
            100    104   108    112   116    120   124    128

Searches are pretty easy to figure out given the binary search analogy I gave above: starting from the root, if we don't find the element there we discard the right half if we're looking for a value smaller than the root, and the left half if we're looking for a value larger than the root. Now we are left with the root and half of the tree

                                   114
                       ___,....---'
                      106
                _,..-'   `._
               102          110
             ,'   \       ,'   \
            100    104   108    112

and since the left subtree is a tree in itself, we can go on recursively. So here is a simple search procedure in a binary tree:

    typedef struct tree_node_t tree_node_t;
    typedef struct tree_node_t {
      tree_node_t *tree_left, *tree_right, *tree_parent;
    }
    
    /* Simulate inheritance because I'm using C */
    struct page {
      struct tree_node_t tree_data;
      unsigned long      key;
      ... more fields ...
    }
    
    struct page *tree_search(tree_node_t **proot,
                             unsigned long key)
    {
      tree_node_t *n = *proot;
      struct page *page;
    
      while (n)
      {
         page = (struct page *) n;
    
         if (key < page->key)
            n = n->tree_left;
         else if (key > page->key)
            n = n->tree_right;
         else
            return page;
      }
      return NULL;
    }

Insertions are about as easy. We have to descend down to the bottom of the tree (stopping if we find the element along the way, because binary trees don't allow duplicates) and we'll reach a node that is pretty close to the item we're going to add. We then create a new node and make it the left or right child of the old node. Here is the result of adding 111 and 117 to the tree above; respectively, descending the tree reached the 112 and 116 nodes:

                                   114
                       ___,....---'   `---...___
                      106                       122
                _,..-'   `._              _,..-'   `._
               102          110          118          126
             ,'   \       ,'   \       ,'   \       ,'   \
            100    104   108    112   116    120   124    128
                                /        \
                             111         117

And here is the code:

    struct page *tree_insert(struct tree_node_t **proot,
                             unsigned long key,
                             tree_node_t *node)
    {
      tree_node_t **p = proot;
      tree_node_t *parent = NULL;
      struct page *page;

      while (*p)
      {
         parent = *p;
         page = (struct page *) parent;

         if (key < page->key)
            p = &(*p)->tree_left;
         else if (key > page->key)
            p = &(*p)->tree_right;
         else
            return page;
      }

      node->tree_parent = parent;
      node->tree_left = node->tree_right = NULL;
      *p = node;

      return NULL;
    }

Note that p, a pointer to a pointer to a tree, is effectively pointing to an arc connecting two nodes: the starting node is fixed, the ending node can be modified by assigning to *p. There is also an "arc" leading to the root.

Deleting a node is more complicated, because unless the node is a leaf we must do some node shuffling to find a node that can harmlessly substitute the removed node. An always valid node is the node which is the immediate predecessor of the removed node: for example 112 if we are removing the root, 114. You can see that this node is obtained by going one level left and then right as long as it is possible; then the picked node takes place of the removed node (in this case its children are 106 and 122) and its left node if any (111) replaces the picked node (as child of 110). If there's no left child (like if removing 116), things are easier and the right child (117) can directly replace the removed node.

I do not provide code for deleting nodes from a binary tree for reasons that will be clear later.

We can easily traverse the tree from the smallest to the biggest item, by looking first at all the elements smaller than the root (which in turn form a tree, the left subtree of the root), then at the root itself, and then at elements larger than the root (that is, the right subtree of the root). It is not necessary to loop recursively, because you can keep a stack of the nodes that lead to the current point of the traversal, which even allows you to traverse the tree bidirectionally. If we do things carefully and add to each node a pointer to each parent (like I did in my implementation of binary trees above) we can even dispose the stack.

Compared to other structures such as arrays, binary search trees have the disadvantage of not allowing random access (i.e. access to the n-th element), but insertions and deletions should always be efficient, no matter if they're made at the beginning, middle, or end of the pre-existing data.

The problem is that sometimes binary trees can become much taller than their optimal height. Unluckily, the worst case happens when the nodes are inserted in sorted order: then we keep adding the node as the right child of the rightmost leaf, and we are actually constructing a doubly-linked list (the "previous" link comes from the parent pointer). Logarithmic performance is gone if we don't do anything to restore it.

Balancing the tree

First of all, let me point out that achieving perfect balancing like in the trees I drew above is impossible. You can balance a tree like that, but it has the same cost as sorting an array, so it is not interesting because it is an off-line algorithm (that is, it is effective if you do it once, but not if you do it every time).

We will content of an approximation: if the average depth of the nodes in our tree is twice the optimal one, it is "the same" as if we were using a search procedure on an optimal tree that took twice the time to descend a node. In other words, what matters is that we give a good bound to the average or maximum depth of the tree.

This is exactly what AVL and red-black trees do. AVL trees do more work on insertion and deletion, are considerably easier to implement, and keep the depth very close to the optimal one (41 instead of 32 for 2^32 nodes); the maximum depth for red-black trees instead is exactly twice the optimal one but the insertion and deletion are faster than in AVL trees (and a mess to implement as well...).

The operations used for minimizing tree height are said to "rebalance" the tree. These operations are always a combination of tree rotations, where tree rotation is a transform that varies the height of a tree's subtrees:

                               |             |
                               Y             X
                              / \           / \
                             X   c  ===>   a   Y
                             ^                 ^
                            a b               b c

A rotation that changes a binary tree of the form shown on the left to the form shown on the right is called a "right rotation" (or clockwise rotation) on Y: it makes the right subtree taller and the left subtree shorter. Done the other way, it is a "left rotation" (counter-clockwise) on X, which makes the left subtree taller at the expense of the right subtree.

    static void tree_rotate_left(tree_node_t *node, tree_node_t **root)
    {
      tree_node_t *right = node->tree_right;
    
      if ((node->tree_right = right->tree_left))
         right->tree_left->tree_parent = node;
      right->tree_left = node;
    
      if ((right->tree_parent = node->tree_parent))
      {
         if (node == node->tree_parent->tree_left)
            node->tree_parent->tree_left = right;
         else
            node->tree_parent->tree_right = right;
      }
      else
         *root = right;
      node->tree_parent = right;
    }
    
    static void tree_rotate_right(tree_node_t *node, tree_node_t **root)
    {
      tree_node_t *left = node->tree_left;
    
      if ((node->tree_left = left->tree_right))
         left->tree_right->tree_parent = node;
      left->tree_right = node;
    
      if ((left->tree_parent = node->tree_parent))
      {
         if (node == node->tree_parent->tree_right)
            node->tree_parent->tree_right = left;
         else
            node->tree_parent->tree_left = left;
      }
      else
         *root = left;
      node->tree_parent = left;
    }

In practice the primary advantage of red-black trees is that deletion from an AVL trees requires at most log2 n rotations, but deletion in a red-black tree never requires more than three rotations.

The AVL tree

The name of the AVL tree comes from its two inventors, two Russian mathematicians named, respectively, Adel'son-Vel'skii and Landis. The AVL rule for balancing trees is very easy: the height of the left subtree must differ from the height of the right subtree by at most one. This rule is enforced every time an insertion or deletion is made in the tree.

To implement AVL trees, therefore, we need one more field, the node's height, in the tree data structure:

    typedef struct avl_node_t avl_node_t;
    typedef struct avl_node_t {
      avl_node_t *avl_left, *avl_right, *avl_parent;
      int         avl_height;
    }

The height is zero for a leaf, while for an internal node is the maximum of the left subtree's height and the right subtree's height, plus one. Here are examples of AVL trees with, in parentheses, the height of each node:

                       4(2)                   4(3)
                      / \                    / \
                  (1)2   5(1)            (2)2   5(1)
                     ^                      ^
                 (0)1 3(0)              (1)1 3(0)
                                          /
                                      (0)0

Note how the right tree is not perfectly balanced, yet it follows the AVL rules.

The implementation's idea is to provide two functions, one to rebalance after insertion of a node, and one to delete a node. We provide deletion separately because we need to maintain some information to rebalance AVL trees after deletion, that are not needed for plain binary trees. Instead, insertion can be done simply by calling an avl_rebalance function after adding the leaf. Our insertion function hence will go like this:

    struct page *avl_insert(struct tree_node_t **proot,
                            unsigned long key,
                            tree_node_t *node)
    {
      tree_node_t **p = proot;
      tree_node_t *parent = NULL;
      struct page *page;

      while (*p)
      {
         parent = *p;
         page = (struct page *) parent;

         if (key < page->key)
            p = &(*p)->tree_left;
         else if (key > page->key)
            p = &(*p)->tree_right;
         else
            return page;
      }

      node->tree_parent = parent;
      node->tree_left = node->tree_right = NULL;
      *p = node;

      avl_rebalance (node, proot);
      return NULL;
    }

Note that the rebalancing process starts from the newly inserted leaf. I'll present the deletion algorithm now since there are no new concepts in it: it works the same as for plain trees, with a final rebalancing phase starting >from the deepest node for which we changed an arc:

    void avl_erase (avl_node_t *node, avl_node_t **tree)
    {
      avl_node_t *parent = node->avl_parent;
      
      /* Not a typo: this is the arc leading to the deleted node */
      avl_node_t **arc_to_deleted;
      avl_node_t *newnode;
      avl_node_t *deepest;
     
      if (!parent)
        arc_to_deleted = tree;
      else if (parent->avl_left == node)
        arc_to_deleted = &parent->avl_left;
      else
        arc_to_deleted = &parent->avl_right;
    
      deepest = node;
      if (!node->avl_left) {
        /* The easy way -- just remove a level from the tree */
        newnode = node->avl_right;
        if (newnode)
          newnode->avl_parent = node->avl_parent;
    
      } else {
        /* The hard way -- the node we want is the rightmost child in
           the left subtree */
    
        avl_node_t **arc = &node->avl_left;
        for (;;) {
          newnode = *arc;
          if (!newnode->avl_right) break;
          deepest = newnode;
          arc = &newnode->avl_right;
        }

        /* The trees lose a level here */
        *arc = newnode->avl_left;

        /* Fix all the links to put the node in place */
        newnode->avl_parent = node->avl_parent;
        newnode->avl_left = node->avl_left;
        newnode->avl_right = node->avl_right;
        newnode->avl_height = node->avl_height;

        if (newnode->avl_left)
          newnode->avl_left->avl_parent = newnode;
        if (newnode->avl_right)
          newnode->avl_right->avl_parent = newnode;
      }

      *arc_to_deleted = newnode;
      avl_rebalance(deepest->avl_parent, tree);
    }

Now, let's attack the rebalancing. The process starts from a deeo node and bubbles upwards, updating the height factors, performing rotations if needed and stopping at the root or when we see a node whose height was not changed by the insertion or deletion.

There are four ways the AVL rules can be violated after an insertion or deletion: I'll present the two when the the left subtree height is two more than the right height; the other two are completely symmetric, and you can simply interchange left with right:

           Y                       W
        ,-' `-.                 ,-' `-.
       W       Z      -->      V       Y
      / \                     / \     / \
     V   X                   a   b   X   Z
     ^
    a b                    Right rotation around Y

            Y                      X
         ,-' `-.                ,-' `-.
        W       Z     -->      W       Y
       / \                    / \     / \
      V   X                  V   a   b   Z
          ^
         a b               Left rotation around W
                           + Right rotation around X

The two cases can be distinguished by comparing the height of V and X: in the former case V (node_ll in the code below) is taller, in the latter X (node_lr is).

The while loop stops when the root is reached, but we can also end when the routine finds a node whose height factor did not change. This happens at the end of the routine (in the last else branch). Now here is the routine:

    #define HEIGHT(node)   (node ? node->avl_height : 0)
      
    void avl_rebalance (avl_node_t *node, avl_node_t **tree)
    {
      while (node) {
        avl_node_t **arc;
        avl_node_t *parent = node->avl_parent;
        avl_node_t *node_l = node->avl_left;
        avl_node_t *node_r = node->avl_right;
        unsigned int hgt_l = HEIGHT(node_l);
        unsigned int hgt_r = HEIGHT(node_r);
          
        if (!parent)
          arc = tree;
        else if (parent->avl_left == node)
          arc = &parent->avl_left;
        else
          arc = &parent->avl_right;
          
        if (hgt_r + 1 < hgt_l) {
          /*                                                      */
          /*                            *                         */
          /*                          /   \                       */
          /*                       n+2      n                     */
          /*                                                      */
          avl_node_t *node_ll = node_l->avl_left;
          avl_node_t *node_lr = node_l->avl_right;
          unsigned int hgt_lr = HEIGHT(node_lr);
          if (HEIGHT(node_ll) >= hgt_lr) {
            /*                                                        */
            /*                *                    n+2|n+3            */
            /*              /   \                  /    \             */
            /*           n+2      n      -->      /   n+1|n+2         */
            /*           / \                      |    /    \         */
            /*         n+1 n|n+1                 n+1  n|n+1  n        */
            /*                                                        */
            node->avl_left = node_lr;
            node->avl_height = 1 + hgt_lr;
            if (node_lr)
              node_lr->avl_parent = node;
     
            node_l->avl_right = node;
            node->avl_parent = node_l;
            node_l->avl_height = 1 + node->avl_height;
            node_l->avl_parent = parent;
            *arc = node_l;
          } else {
            /*                                                        */
            /*                *                     n+2               */
            /*              /   \                 /     \             */
            /*           n+2      n      -->    n+1     n+1           */
            /*           / \                    / \     / \           */
            /*          n  n+1                 n   L   R   n          */
            /*             / \                                        */
            /*            L   R                                       */
            /*                                                        */
            node_l->avl_right = node_lr->avl_left;
            node->avl_left = node_lr->avl_right;
            if (node_l->avl_right)
              node_l->avl_right->avl_parent = node_l;
     
            if (node->avl_left)
              node->avl_left->avl_parent = node;
     
            node_lr->avl_left = node_l;
            node_lr->avl_right = node;
            node_l->avl_parent = node->avl_parent = node_lr;
            node_l->avl_height = node->avl_height = hgt_lr;
            node_lr->avl_height = hgt_l;
            node_lr->avl_parent = parent;
            *arc = node_lr;
          }
      
        } else if (hgt_l + 1 < hgt_r) {
          /* similar to the above, just interchange 'left' <--> 'right' */
          avl_node_t *node_rr = node_r->avl_right;
          avl_node_t *node_rl = node_r->avl_left;
          unsigned int hgt_rl = HEIGHT(node_rl);
          if (HEIGHT(node_rr) >= hgt_rl) {
            node->avl_right = node_rl;
            node->avl_height = 1 + hgt_rl;
            if (node_rl)
              node_rl->avl_parent = node;
         
            node_r->avl_left = node;
            node->avl_parent = node_r;
            node_r->avl_height = 1 + node->avl_height;
            node_r->avl_parent = parent;
            *arc = node_r;
           } else {
             node_r->avl_left = node_rl->avl_right;
             node->avl_right = node_rl->avl_left;
             if (node_r->avl_left)
               node_r->avl_left->avl_parent = node_r;

             if (node->avl_right)
               node->avl_right->avl_parent = node;

             node_rl->avl_right = node_r;
             node_rl->avl_left = node;
             node_r->avl_parent = node->avl_parent = node_rl;
             node_r->avl_height = node->avl_height = hgt_rl;
             node_rl->avl_height = hgt_r;
             node_rl->avl_parent = parent;
             *arc = node_rl;
           }
     
        } else {
          unsigned int height = (hgt_l < hgt_r ? hgt_r : hgt_l) + 1;
          if (height == node->avl_height) break;
          node->avl_height = height;
        }
     
        node = parent;
      }
    }

Yes, I know, this is more than an hundred lines of code. But this is completely generic and can be used no matter what your tree hosts: you can put it in a library like the one you will find in the bonus pack, and use it freely in all your projects.

The red-black tree

The rules for red-black tree are based on those for B-Trees, the data structures used to build indices in database and high-end filesystems (not FAT; Linux's reiserfs and OS/2's HPFS use it). B-Trees are not binary anymore: a node has k children and k-1values in it which establish an k-way partition. The first child hosts the keys that are lower than all the values; the second child has the keys staying between the first and second value; and so on until the last child, which has the keys that are higher than all of the values. You establish a maximum number n of children, and the constraint is that the number of children cannot fall below n/2.

Note that, at worst, the height of a tree with m elements is log m/log (n/2), that is, the base-n/2 logarithm of m. B-trees cannot have a single child (like in the degenerate binary tree) and this is sufficient to make them always balanced!

Now let's put n=4. Adding to a 2-node (1 value, 2 possibly empty children) results in a 3-node, adding to a 3-node creates a 4-node. When you are adding to a 4-node (a node with 3 values and 4 children), you have a (temporary) 5-node that you can split into two valid nodes like this

             ___|___                 F
             B F H L              ,-' `-.
            / | | | \            B     H L
                                / |   / | \

             5-node      ==>  2-node + 3-node

What about F? It is added to the parent node: The arc that pointed to the original 4-node will point to the 2-node, and the newly acquired arc will point to the 3-node. If the parent becomes a 5-node, you can split it too, until you reach the root.

Now, red-black trees manage to turn this structure, difficult to implement efficiently (for example, the values in a node must be kept sorted!) into a binary tree. They establish two kinds of links between a parent and one of its children: a red link represents a link between two values within the same 3-node or 4-node, and a black link instead represents a link between two different B-tree nodes. In these diagrams, black links won't be drawn black for obvious reasons:

             a                  /
            / \                a        A 2-node
           B   C              / \
                             B   C

            a b                 /
           / | \               b        A 3-node
          B  C  D             / \
                             a   D
                             ^
                            B C
       
           a b c               /
          / | | \              b        A 4-node
         B  C D  E            / \
                             a   c
                             ^   ^
                            B C D E

All paths from a given node down to a leaf must cross the same number of black links (because it is a theorem that all paths from a B-tree node down to a leaf have the same height), and no node that is red-linked to its parent can be red-linked to one of its children (because the only valid red-link configurations are the two shown above).

The maximum height of a red-black tree is twice the height of the B-tree, because each node in the B-tree takes at most two levels of the red-black tree. Since the B-tree's maximum height is log2 m (see above; in this case n=2), the red-black tree's maximum height is twice the optimal height of the binary tree.

Red-black trees have interesting and easy to prove properties like this one, and this is another reason why they are often used. However, they are considerably harder to implement. Here I'll write only about the insertion algorithm, but I will show you no code; you can find it in the bonus pack, and the interface is the same as for AVL trees with avl_ replaced by rb_. Note that in the code the color of a link is stored in the child node: thus n->rb_left->rb_color is the color of the left link and n->rb_right->rb_color is the color of the right link.

The algorithm is not so difficult to follow if you do it slowly... Let's call X the leaf to which we just added a child W. If X has a black link to its parent, the link to the child must be red because X has now become a 3-node or 4-node. If X has a red link to its parent, and has no sibling, X is part of a 3-node that must become a 4-node: in this case the tree is rotated around X so that the newly-added node becomes a sibling of X's former parent.

Finally, if X was part of a 4-node (red-linked, and with a sibling Z) we have four cases, two simple and two complicated.

These are the two simple cases. Here X's parent must have two black-links, and that's all.

    Wrong RB-tree   The 5-node      The correct B-tree
    
         Y           ___|___               Y
        / \          W X Y Z           ,--' `--.
       X   Z         ^  | | |   ==>   W X       Z
       ^   ^            a b c         ^  \     / \ 
      W a b c                             a   b   c

         V           ___|___               V
        / \          U V W X           ,--' `--.
       U   X        | |  ^  |   ==>   U       W X
       ^   ^        a b     c        / \      ^  \
      a b W c                       a   b         c

And these are the two complicated cases. They are complicated because W, being the greater value in the 3-node, must become the root of the 3-node, owning a red link to X. The left child of W will become a (black) child of X.

    Wrong RB-tree   The 5-node      The correct B-tree      Final subtree
    
         Y           ___|___               Y                     W
        / \          X W Y Z           ,--' `--.                / \
       X   Z        |  ^  | |   ==>   X W       Z              X 
       ^   ^        a     b c        /  ^      / \            / \
      a W b c                       a         b   c          a
      
         V           ___|___              V                      W
        / \          U V X W          ,--' `--.                 / \
       U   X        | | |  ^    ==>  U       X W               X 
       ^   ^        a b c           / \     /  ^              / \
      a b c W                      a   b   c                 c

So we have to left-rotate W and recolor like in the rightmost part of the diagram above. Then the process is repeated: the rebalancing will continue like if X's former parent was added, until it is not part of a 4-node.

A test program

The bonus pack includes a test program with a makefile. It results in two programs called avl_test and rb_test, which respectively show AVL trees and red-black trees at work. The program will insert and remove nodes from a tree whose keys are characters, in a more or less random order, and print the tree after each pass (type Enter to go on). This program is written in C++ to show how you can use inheritance to simplify the code (plain C code is found directly in this tutorial, so...).

Acknowledgements

Thanks to Ben Pfaff for writing a whole book and library on search trees. This is available in every GNU mirror, in the avl subdirectory (but it includes red-black trees and plain trees, all in many different variations); sometimes it goes a bit academic, but it was very helpful to double check my statements, and some of the ASCII art for trees comes from there. Also thanks to Andrea Arcangeli (the biggest Italian Linux kernel hacker!) and Bruno Haible, on whose code I based my examples.

This tutorial is much more dense than I had planned; nevertheless I hope that it is useful (both in the context of coding and in the context of studying data structures, maybe at school/university). Have fun!

--
|_  _  _ __
|_)(_)| ),'
        `-._